AllLife Bank Project

Description

Background and Context

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

Objective

  1. To predict whether a liability customer will buy a personal loan or not.
  2. Which variables are most significant.
  3. Which segment of customers should be targeted more.

Data Dictionary

Load Libraries

Load Data

Observation

  1. Personal_Loan Securities_Account CD_Account Online CreditCard have int64 of 1 and 0 which indicates YES and NO

Observation

There are no missing values in the data

Observation

  1. The variables seem to be left-skewed
  2. 67years of Age is the oldest account holder in the bank and an average of customers is 45years of age
  3. Average age of customers is 45years old.
  4. Mean of the banks Customers have 20years work experince, with a maximum of 43years.
  5. Maximum income earners is 224,000 and a miniumu of 8,000, with an average of 73,000.
  6. Customers with ZIPCode 94,720 are the highest number of customers with a count of 169 out of 5,000 customers.
  7. Customers with a Family of 1 have the highest frequancy count of 1472.
  8. Average spending on credit cards per month (CCAvg), there are customers who dont have credit cards and the average customers
  9. Undergraduated have more accounts in the bank with a frequency of 2096 customer our of 5000
  10. An average customer has a house mortgave value of 56,5000
  11. Customer of the bank do not take personal loan as its on a low of mean of 0.096
  12. 14% of the bank customers have security account holders
  13. 6% of the bank customers have certificate of deposit
  14. 60% of customers use online banking facilities.
  15. 29.4% use other banks credit card.

Finding the amount of Unique values of the variables

Exploratory Data Analysis

Univariance

AGE

Observation

  1. The mean and Median of the for the Age is close.
  2. 46 years old id the average age of costomers

EXPERIENCE

Observation

  1. The average and Median work Experience of customers is 20years

INCOME

Observation
  1. Income is rigthly skewed
  2. Maximum income earners is 224,000 and a miniumu of 8,000, with an average of 73,000.
  3. Income earners of 180,000 and above are the outliers(one-offs)

ZIPCODE

Observation
  1. The concentration of the banks customers live in a ZIPCODE area of 93000 mean
  2. Customers between zipcode 94,000, 90000 and 95,000 have the higest frequency count

CCAvg

Observation

  1. CCAvg is right-skwed
  2. The outlies are customers of 5,000-10,000 average credit spending a month
  3. The bank has more customers with a CCAvg of 1000-2000 per month

MORTGAGE

Observation
  1. An average customer has a house mortgave value of 56,5000
  2. Higer count of customers dont have a mortage

PERSONAL LOAN

Observation
  1. 90.4% of the bank customers do not take personal loans
  2. 9.6% take personal loans from the bank

EDUCATION

Observation
  1. Undergraduates and Advance/Professionals are the highest number of customers in the bank with 41.9% and 30% respectively
  2. Graduate customers are the least
Observation
  1. 89.6% of the customer dont have security accounts.
  2. 10.4% of the customer have security accounts

Observation

  1. 94% of the customer don't have Certificate of deposit Account.
  2. 6% of Certificate of deposit account

Observation

  1. 59.7% use the banks online banking application.
  2. 40.3% do not use online banking application

Observation

  1. 70.6% do not use other banks credit card.
  2. 24.4% use other banks credit card

BIVARIANCE ANALYSIS

AGE VS PERSONAL LOAN

EXPERIENCE VS PERSONAL LOAN

INCOME VS PERSONAL LOAN

Observation

  1. The customers who take personal loan earn a median of 140,000usd, maximum income of 205,000usd and minimum income of 60,000usd.
  2. The customers who dont take personal loans have the lower income of a median of 60,000.
  3. customers who have the highest income who are the outliers do not take personal loans.

ZIPCode VS PERSONAL LOAN

Observation

  1. The is more count of customer who took personal loan within zipcode 94000 and 95000

FAMILY VS PERSONAL LOAN

Observation

The family of 3 and 4 are the bank customers who took the highest percentage of personal loans.

CCAvg VS PERSONAL LOAN

Observation

  1. Customers with higher credit card average take personal loans
  2. The higher credit card average holder who dont take personal loans are the outliers

EDUCATION VS PERSONAL LOAN

Observation

  1. Education 2 and 3 who are the Graduate and Advance/professionals have a higher chance of taking a personal loan.
  2. Education 1 who is the undergraduate has the number of customers that take personal loan.

Observation

  1. Customers with higher mortgage took personla loans.

**Higher number of securities account holder too personal loans

9% of securities account and Non Securites account holders took personal loan

All Certificate of Deposit account holders took personal loans

Observation

  1. 46% of Customers with Certificate Deposit account have personal loans while 7% of non Certificate Deposit account have personal loans.
  2. Both CD_Account holder are only 9% of the bank customer population

Heat Map to check the correlation between variables

  1. CCAvg has a high correlation to Income of 0.65
  2. Personal Loan has a 0.5 correlation with Income

Model Building - Approach

  1. Data preparation
  2. Partition the data into train and test set.
  3. Built a CART model on the train data.
  4. Tune the model and prune the tree, if required.
  5. Test the data on test set.

Data Preperation

Split Data

Build Decision Tree Model

Observation

We only have 9% of positive classes, so if our model marks each sample as negative, then also we'll get 91% accuracy, hence accuracy is not a good metric to evaluate here.

Insights:

Observation

Observation

Visualizing the Decision Tree

Observation

Test for overfitting

Using GridSearch for Hyperparameter tuning of our tree model

Recall has not improved for both train and test set after hyperparameter tuning and we do not have a generalized model.

Observation

  1. Income is the important variable to determine customers who will take personal loan

Total impurity of leaves vs effective alphas of pruned tree


Minimal cost complexity pruning recursively finds the node with the "weakest link". The weakest link is characterized by an effective alpha, where the nodes with the smallest effective alpha are pruned first. To get an idea of what values of ccp_alpha could be appropriate, scikit-learn provides DecisionTreeClassifier.cost_complexity_pruning_path that returns the effective alphas and the corresponding total leaf impurities at each step of the pruning process. As alpha increases, more of the tree is pruned, which increases the total impurity of its leaves.

Next, we train a decision tree using the effective alphas. The last value in ccp_alphas is the alpha value that prunes the whole tree, leaving the tree, clfs[-1], with on

Observation

Observation

Observation

Final result of decision tree Model Table

Decision tree model with Post-Pruning has given the best recall score on data

Table and EDA Result for false prediction

Observation:

LOGISTIC REGRESSION

Dealing with outliers

DATA BUILDING

DATA SPLIT

Observation:

  1. The recall for Test and Training close but not close to 1 and almost alined

AUC ROC curve

Optimal threshold

Observation

After using optimal threshold we see that true positives have increased from 6% to 9%. and false Negatives has decreased from 90% to 29%

Let us ensure that Multicollinearity doesn't exis

Observation

Observation

Income and Zipcode have high VIF. Zipcode will be drop becasue it has the highest VIF.

All the variables are below 5.

**Calculate the odds ratio from the coef using the formula odds ratio=exp(coef)

Calculate the probability from the odds ratio using the formula probability = odds / (1+odds)

Observation

Lets look at most significant variable

Education_3 and Education_2 are the most significant varaible for prediction

Prediction of the model

Prediction on Train data

Prediction on Test data

Choosing Optimal threshold

Positive Prediction Final result of Logistic regression table

Table and EDA Result for false prediction

Observation

Recommendation

From my observation and chosing Decision Tree Model as the best model for prediction here are my recommendations